Before we can begin any script we first need to make sure that the required packages are installed in our version of RStudio. Next, we can load the required packages to be used in the script. The code block below will do this for you.
# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyr)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')
library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)
You should be able to see that we have installed and loaded 3
different packages. Let’s first go over the basics of what a
package is. In its simplest terms, a package is a
toolbox that someone has created for us in R that makes
our life easier. These packages build on the basic code that
comes with the R programming language (what RStudio
uses to run), called base R.
Figure 1.0.1: Opening an R package
It is always a good idea to check the documentation
for a package before you use it. We can do this by using the
help syntax, which is the ?. The package
we are trying to get help with is called here. Try to run
this code by clicking on the green arrow on the corner of the
code block on the left side of your screen. This will open a
webpage that tells us the purpose of the
here package and how it works.
Fig 1.0.2: Running code in R
?here #? loads the documentation for a specified package.
Fill in the code block below by putting in the
help syntax ? and the name of the package
you are interested in. This will get the documentation for the
other packages we are using. You can do this by
substituting in the packages from 1.0.1. Have a read of
each of these pages and click on any links you find interesting. These
are the main packages we will be using throughout this
course.
# Your code goes here!
The dataset we are using has already been downloaded in the folder containing this R Markdown file (Aside: An R Markdown is basically a file that is capable of containing both plain text and code). On your computer navigate to this folder and have a look at what it contains.
You should note that it contains the following:
These are the key ingredients needed to organise all projects in R.
Fig 1.1.0: Tutorial 2 folder
You will notice that the data for today, called
PSYC2001_social-media-csv, is a csv file
(short for a Comma Separated Value file). This means that we
will need to import the dataset using a function
capable of importing CSV files.
We will be using two different functions to achieve
this. The read.csv() function is used to import our CSV
dataset and it comes from the utils package which is part
of base R. But the read.csv() function needs
to know where the file is coming from. To do this, we use the
here() function from the here package. This
function tells R the location of the project we are
working from, to make locating the data easier.
social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in CSV files
Our data should now be imported into R! (If you have an error, something has gone wrong—please ask your tutor for help!)
The first thing we should do whenever we import data is to see how it looks in RStudio. There are a couple of ways to do this.
Fig 1.1.1: Navigating to dataset
# Method 1 - Type in the name of the object
social_media
# Method 2 - Use the View function
View(social_media) #view automatically displays the dataset in a tab.
# Method 3 - Use the head function
head(social_media) #head displays the first 6 rows of each variable.
## id age time_on_social urban good_mood_likes bad_mood_likes followers
## 1 S1 15.2 3.06 1 22.8 46.5 173.3
## 2 S2 16.0 2.18 1 46.0 48.3 144.3
## 3 S3 16.8 1.92 1 50.8 46.1 76.5
## 4 S4 15.6 2.61 1 29.9 29.2 171.7
## 5 S5 17.1 3.24 1 37.1 52.4 109.5
## 6 S6 15.7 2.44 1 26.9 20.2 157.5
## polit_informed polit_campaign polit_activism
## 1 2.3 3.2 3.6
## 2 1.6 2.2 2.6
## 3 1.9 2.7 3.0
## 4 1.6 2.3 2.6
## 5 2.0 2.9 3.3
## 6 2.4 3.4 3.9
# Method 4 - Use the str function
str(social_media) #displays an overall summary of the object and variable structure.
## 'data.frame': 60 obs. of 10 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ age : num 15.2 16 16.8 15.6 17.1 15.7 19.7 18.6 19.6 15.5 ...
## $ time_on_social : num 3.06 2.18 1.92 2.61 3.24 2.44 1.46 1.52 1.92 2.1 ...
## $ urban : int 1 1 1 1 1 1 1 1 1 1 ...
## $ good_mood_likes: num 22.8 46 50.8 29.9 37.1 26.9 14.8 26 6.5 45.7 ...
## $ bad_mood_likes : num 46.5 48.3 46.1 29.2 52.4 20.2 35.1 35.8 12.2 32.8 ...
## $ followers : num 173.3 144.3 76.5 171.7 109.5 ...
## $ polit_informed : num 2.3 1.6 1.9 1.6 2 2.4 1.7 1.6 1.5 2.2 ...
## $ polit_campaign : num 3.2 2.2 2.7 2.3 2.9 3.4 2.4 2.2 2.1 3.1 ...
## $ polit_activism : num 3.6 2.6 3 2.6 3.3 3.9 2.7 2.6 2.4 3.5 ...
You should now have a good idea of what
PSYC2001_social-media-csv looks like in RStudio.
You will also notice that the last function, str(),
displays a summary of the object. This includes:
id, and num for all other
variablesPlease discuss with your deskmate and tutor what you think chr and num mean.
Figure 1.1.2: You thinking
Once we have imported our dataset into R, it’s important to check the
quality and structure of the data to ensure everything looks as
expected. One simple way to do this is by using the
summary() function.
summary(social_media) #summary provides a quick overview of the data in each variable.
## id age time_on_social urban
## Length:60 Min. :13.90 Min. :-999.000 Min. :1.0
## Class :character 1st Qu.:15.70 1st Qu.: 1.920 1st Qu.:1.0
## Mode :character Median :16.50 Median : 2.365 Median :1.5
## Mean :16.87 Mean : -30.845 Mean :1.5
## 3rd Qu.:17.43 3rd Qu.: 3.042 3rd Qu.:2.0
## Max. :23.00 Max. : 4.320 Max. :2.0
## good_mood_likes bad_mood_likes followers polit_informed
## Min. : 6.50 Min. :12.20 Min. : 61.40 Min. :0.600
## 1st Qu.:31.60 1st Qu.:39.08 1st Qu.: 76.47 1st Qu.:1.500
## Median :45.90 Median :49.30 Median :116.30 Median :1.800
## Mean :43.04 Mean :49.84 Mean :124.76 Mean :1.858
## 3rd Qu.:53.40 3rd Qu.:58.75 3rd Qu.:153.75 3rd Qu.:2.200
## Max. :89.20 Max. :91.20 Max. :336.50 Max. :3.400
## polit_campaign polit_activism
## Min. :0.800 Min. :0.900
## 1st Qu.:2.100 1st Qu.:2.400
## Median :2.550 Median :2.900
## Mean :2.602 Mean :2.977
## 3rd Qu.:3.100 3rd Qu.:3.500
## Max. :4.800 Max. :5.500
Do you notice anything unusual in the output of this data? (Hint:
take a closer look at the time_on_social variable.)
Discuss anything that looks unusual with your deskmate and your tutor.
It should now be clear that this data is unusual because it has a
minimum value of -999 in the
time_on_social variable which is measured in hours (we can’t have
negative time !).
Fig 1.2.1: Back to the future !
A good question to ask now is - why are these values in the dataset?
Sometimes when collecting data, we can’t get a response from every
participant. Instead of leaving a blank, researchers will sometimes put
in a placeholder value like -999 to show that the data is
missing.
These aren’t real numbers; they just mean the data wasn’t recorded.
But -999 isn’t the standard way to show missing data in R.
R uses NA to represent missing values, and that’s important
because most R functions know how to handle NA properly—but
they don’t know to ignore -999.
Let’s clean this up by swapping all the -999 values in
the time_on_social column for NA. We can do this using the
mutate() and na_if() functions from the
tidyverse package.
social_media_NA <- social_media %>%
mutate(time_on_social = na_if(time_on_social,-999)) #mutate alters columns and rows.
#na_if replaces -999 with NA.
ggplot2Now let’s look at some data! We’re going to start by visualising the
time_on_social variable. Visualizing helps us understand
more about the distribution of the data, which helps us understand what
kinds of analysis we can perform.
To do this we will need to use the ggplot() function.
This is the main function from the ggplot2 package (you
should know what this is from reading the documentation).
ggplot() provides the canvas of the graph you want to
make.
To make the basic canvas ggplot() requires two
things:
2.The variables to go on the x and y axes.
Importantly, ggplot() only provides the canvas. It does
not draw anything by itself. You have to add layers to the canvas
created by ggplot() by using other functions that can
create bars, points or lines !
Here we use geom_boxplot() - can you guess what this
does ?
social_media_NA %>%
ggplot(aes(y = time_on_social),) + #ggplot uses aesthetic (aes()) to map axes.
scale_x_discrete() + #this tells ggplot that the x-axis is categorical.
geom_boxplot() + #creates a boxplot
labs(y = "Time on Social Media") #short for "labels", use to label axes and titles.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Notice that we get a warning. This is because ggplot()
is able to recognise and remove NA values. Be careful as
not all functions in R are able to do this automatically (but most are
!)
What approximately is the median value? The lower quartile? The upper quartile? Is there another way that we could get this information in a more exact form ? Discuss this with your deskmate and your tutor.
ggplot()ggplot() can be customised with so many other functions
that we have shown here to make truly beautiful
looking plots. We will be learning how to do this throughout the
next few weeks.
For now lets see if you can put some of the skills you have learned
so far to good use. See if you can work out how to make a histogram of
the data using the function geom_histogram() (Hint: You
will only need to provide a y variable this time !).
What conclusions would you draw about the shape of the data, given your histogram? Please discuss with your deskmate and tutor.
# Your code goes here!
Well done ! You have completed everything you need to for this week. If you have finished in a record time please consult with your tutor about what to do next. Otherwise we will see you next lab !
Fig 1.3.1: Students reaction to this information !